Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge summaries #1

Open
wants to merge 8 commits into
base: qemu
Choose a base branch
from
Open

Merge summaries #1

wants to merge 8 commits into from

Conversation

tonyhutter
Copy link
Owner

Just testing, please ignore

@mcmilk
Copy link

mcmilk commented Jun 21, 2024

Thank you a lot, I will use it.

@mcmilk
Copy link

mcmilk commented Jun 21, 2024

The FresBSD images have the repository now: https://github.com/mcmilk/openzfs-freebsd-images/releases

@mcmilk
Copy link

mcmilk commented Jun 25, 2024

FreeBSD 13 has problems with the virtio nic. Just use the e1000 nic, like I gave done it here: https://github.com/mcmilk/zfs/tree/qemu-machines2

@mcmilk
Copy link

mcmilk commented Jun 25, 2024

FreeBSD 13 has problems with the virtio nic. Just use the e1000 nic, like I gave done it here: https://github.com/mcmilk/zfs/tree/qemu-machines2

Ah, some other problem :(

@tonyhutter tonyhutter force-pushed the qemu2 branch 3 times, most recently from 136cf60 to 73f5c57 Compare June 25, 2024 20:42
mcmilk and others added 3 commits August 5, 2024 16:17
The timezone "US/Mountain" isn't supported on newer linux versions.
Using the correct timezone "America/Denver" like it's done in FreeBSD
will fix this. Older Linux distros should behave also okay with this.

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
This test was failing before:
- FAIL cli_root/zfs_copies/zfs_copies_006_pos (expected PASS)

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
This includes the last 12.x release (now EOL) and 13.0 development
versions (<1300139).

Sponsored-by: https://despairlabs.com/sponsor/

Signed-off-by: Rob Norris <robn@despairlabs.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
@tonyhutter
Copy link
Owner Author

@mcmilk the zpool_status_008_pos failures are just timing. This fixes it:

diff --git a/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_008_pos.ksh b/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_008_pos.ksh
index 6be2ad5a7..70f480cbb 100755
--- a/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_008_pos.ksh
+++ b/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_008_pos.ksh
@@ -69,12 +69,12 @@ for raid_type in "draid2:3d:6c:1s" "raidz2"; do
        log_mustnot eval "zpool status -e $TESTPOOL2 | grep ONLINE"
 
        # Check no ONLINE slow vdevs are show.  Then mark IOs greater than
-       # 10ms slow, delay IOs 20ms to vdev6, check slow IOs.
+       # 40ms slow, delay IOs 80ms to vdev6, check slow IOs.
        log_must check_vdev_state $TESTPOOL2 $TESTDIR/vdev6 "ONLINE"
        log_mustnot eval "zpool status -es $TESTPOOL2 | grep ONLINE"
 
-       log_must set_tunable64 ZIO_SLOW_IO_MS 10
-       log_must zinject -d $TESTDIR/vdev6 -D20:100 $TESTPOOL2
+       log_must set_tunable64 ZIO_SLOW_IO_MS 40
+       log_must zinject -d $TESTDIR/vdev6 -D80:100 $TESTPOOL2
        log_must mkfile 1048576 /$TESTPOOL2/testfile
        sync_pool $TESTPOOL2
        log_must set_tunable64 ZIO_SLOW_IO_MS $OLD_SLOW_IO

I'm still trying to figure out why raidz_expand_001_pos.ksh is reporting errors. That seems to be the last test that is failing on QEMU.

This commit adds functional tests for these systems:
- AlmaLinux 8, AlmaLinux 9
- ArchLinux
- CentOS Stream 9
- Fedora 39, Fedora 40
- Debian 11, Debian 12
- FreeBSD 13, FreeBSD 14, FreeBSD 15
- Ubuntu 20.04, Ubuntu 22.04, Ubuntu 24.04

Workflow for each operating system:
- install QEMU on the github runner
- download current cloud image
- start and init that image via cloud-init
- install deps and poweroff system
- start system and build openzfs and then poweroff again
- clone the system and start qemu workers for parallel testings
- do the functional testings, hopefully < 3h

Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
@mcmilk
Copy link

mcmilk commented Aug 6, 2024

20240806T173422/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:15] [PASS]
20240806T173539/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [00:53] [PASS]
20240806T173633/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:19] [PASS]
20240806T173755/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:14] [PASS]
20240806T174243/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [02:19] [PASS]
20240806T174503/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [00:55] [PASS]
[root@vm1 test_results]# grep -r ".*raidz_expand_001_pos.ksh.*\[FAIL\]$"|grep result
20240806T170150/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [02:55] [FAIL]
20240806T170812/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [02:57] [FAIL]
20240806T172246/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:04] [FAIL]
20240806T173017/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:02] [FAIL]
20240806T174111/results:Test: /usr/share/zfs/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:29] [FAIL]
[root@vm1 test_results]# grep -r ".*raidz_expand_001_pos.ksh.*\[FAIL\]$"|grep result |wc -l
5
[root@vm1 test_results]# grep -r ".*raidz_expand_001_pos.ksh.*\[PASS\]$"|grep result |wc -l
40

The error comes from the command zpool scrub -w testpool within the ksh function test_scrub:

SUCCESS: zpool import -o cachefile=none -d /var/tmp testpool
SUCCESS: zpool scrub -w testpool
SUCCESS: zpool clear testpool
SUCCESS: zpool export testpool
124+0 records in
124+0 records out
130023424 bytes (130 MB, 124 MiB) copied, 0.295698 s, 440 MB/s
124+0 records in
124+0 records out
130023424 bytes (130 MB, 124 MiB) copied, 0.266317 s, 488 MB/s
124+0 records in
124+0 records out
130023424 bytes (130 MB, 124 MiB) copied, 0.198369 s, 655 MB/s
SUCCESS: zpool import -o cachefile=none -d /var/tmp testpool
SUCCESS: zpool scrub -w testpool
ERROR: check_pool_status testpool errors No known data errors exited 1
NOTE: Performing test-fail callback (/usr/share/zfs/zfs-tests/callbacks/zfs_dbgmsg.ksh)

I think. the scrub -w wants start scrubbing on a pool, which does this already.

Option 1: check the status first, and if already some a scrub started, than just wait for it
Option 2: always stop scrubbing via zpool scrub -s and ignore the exit status of it

What would you prefer?

@tonyhutter
Copy link
Owner Author

@mcmilk I'm currently testing with this:

diff --git a/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh b/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh
index 063d7fa73..167f39cfc 100755
--- a/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh
+++ b/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh
@@ -153,8 +153,12 @@ function test_scrub # <pool> <parity> <dir>
        done
 
        log_must zpool import -o cachefile=none -d $dir $pool
+       if is_pool_scrubbing $pool ; then
+               wait_scrubbed $pool
+       fi
 
        log_must zpool scrub -w $pool
+
        log_must zpool clear $pool
        log_must zpool export $pool
 
@@ -165,7 +169,9 @@ function test_scrub # <pool> <parity> <dir>
        done
 
        log_must zpool import -o cachefile=none -d $dir $pool
-
+       if is_pool_scrubbing $pool ; then
+               wait_scrubbed $pool
+       fi
        log_must zpool scrub -w $pool
 
        log_must check_pool_status $pool "errors" "No known data errors"
diff --git a/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh b/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh
index 004f3d1f9..e416926d1 100755
--- a/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh
+++ b/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh
@@ -105,6 +105,10 @@ for disk in ${disks[$(($nparity+2))..$devs]}; do
                log_fail "pool $pool not expanded"
        fi
 
+       # It's possible the pool could be auto scrubbing here.  If so, wait.
+       if is_pool_scrubbing $pool ; then
+               wait_scrubbed $pool
+       fi
        verify_pool $pool
 
        pool_size=$expand_size

I think that might help some of the raidz_expand_001_pos failures, but not eliminate it completely. There may actually be a legitimate problem that's just being exposed but I'm not sure yet. I want to get some more test runs done to see.

Also, I tweaked my commit a little to add more time in zpool_status_008_pos and fix a rare timing bug in crtime_001_pos.

@mcmilk
Copy link

mcmilk commented Aug 6, 2024

@mcmilk I'm currently testing with this:

diff --git a/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh b/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh
index 063d7fa73..167f39cfc 100755
--- a/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh
+++ b/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh
@@ -153,8 +153,12 @@ function test_scrub # <pool> <parity> <dir>
        done
 
        log_must zpool import -o cachefile=none -d $dir $pool
+       if is_pool_scrubbing $pool ; then
+               wait_scrubbed $pool
+       fi
 
        log_must zpool scrub -w $pool
+
        log_must zpool clear $pool
        log_must zpool export $pool
 
@@ -165,7 +169,9 @@ function test_scrub # <pool> <parity> <dir>
        done
 
        log_must zpool import -o cachefile=none -d $dir $pool
-
+       if is_pool_scrubbing $pool ; then
+               wait_scrubbed $pool
+       fi
        log_must zpool scrub -w $pool
 
        log_must check_pool_status $pool "errors" "No known data errors"
diff --git a/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh b/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh
index 004f3d1f9..e416926d1 100755
--- a/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh
+++ b/tests/zfs-tests/tests/functional/raidz/raidz_expand_002_pos.ksh
@@ -105,6 +105,10 @@ for disk in ${disks[$(($nparity+2))..$devs]}; do
                log_fail "pool $pool not expanded"
        fi
 
+       # It's possible the pool could be auto scrubbing here.  If so, wait.
+       if is_pool_scrubbing $pool ; then
+               wait_scrubbed $pool
+       fi
        verify_pool $pool
 
        pool_size=$expand_size

I think that might help some of the raidz_expand_001_pos failures, but not eliminate it completely. There may actually be a legitimate problem that's just being exposed but I'm not sure yet. I want to get some more test runs done to see.

Also, I tweaked my commit a little to add more time in zpool_status_008_pos and fix a rare timing bug in crtime_001_pos.

I will test run this with a for loop, I think 50 times should be a good start.
Only these two via -t raidz_expand_001_pos + -t raidz_expand_002_pos

It's running with -I 55 option:
https://github.com/mcmilk/zfs/actions/runs/10272187541

@mcmilk
Copy link

mcmilk commented Aug 6, 2024

Hm, the raidz_expand_001_pos tests are failing, even if the wait for scrub thing :(

I used this is_pool_scrubbing $pool && wait_scrubbed $pool
Diff is here: openzfs@02a14e7

@mcmilk
Copy link

mcmilk commented Aug 6, 2024

Almalinux 8 and Ubuntu 20.04 are fine.
Maybe it's bclone related - cause this is disabled on the older kernels?

@mcmilk
Copy link

mcmilk commented Aug 12, 2024

Some points to the raidz_expand_001_pos testing problem:

  • I limited the code to run only on scalar speed (exclude possible assembly failures)
  • FreeBSD 13/14 does not have this issue et all, all tests run fine at around 2m 30s
  • FreeBSD 15 does not have this issue et all, all tests run fine at around 4m
  • on Linux 5.4: timings are around 3m and the faling test rate is: 1/120
  • on Linux 6.x: timings are around 5m 30s and the faling test rate is: 1/2 maybe 1/3

Special test run with only raidz_expand_001_pos: https://github.com/mcmilk/zfs/actions/runs/10346648527

So the raid code is maybe okay... but some spl thing?
Should we run against zfs-2.2.5 ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants